Scaled Log Likelihood Ratios for the Detection of Abbreviations in Text Corpora
نویسندگان
چکیده
We describe a language-independent, flexible, and accurate method for the detection of abbreviations in text corpora. It is based on the idea that an abbreviation can be viewed as a collocation, and can be identified by using methods for collocation detection such as the log likelihood ratio. Although the log likelihood ratio is known to show a good recall, its precision is poor. We employ scaling factors which lead to a strong improvement of precision. Experiments with English and German corpora show that abbreviations can be detected with high accuracy.
منابع مشابه
Viewing sentence boundary detection as collocation identification
The detection of abbreviations is an important step in the process of sentence boundary detection. We describe a flexible, languageindependent and accurate method based on the idea that an abbreviation can be viewed as a collocation. As such, it can be identified by using methods for collocation detection such as the log likelihood ratio. Although the log likelihood ratio is known to show a goo...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملHedges in English for Academic Purposes: A Corpus-based study of Iranian EFL learners
Hedges, as tools to express tentativeness and doubt, have been studied in plenty of research papers in the Iranian EFL research setting. However, their use in a learner corpus, portraying Iranian learner English, is in need of more research attention. With this end in view, this study aimed at investigating how Iranian EFL learners who have majored in English-related fields in Iran deployed hed...
متن کاملExtending the Cochran rule for the comparison of word frequencies between corpora
We first describe a number of inter-related issues that need to be considered by the researcher when comparing frequencies of linguistic features in two or more corpora. We then describe the chi-squared and log-likelihood tests used in previous research for the comparison of word frequencies. Our focus, in this paper, is on the issue of reliability of the statistical tests, and we describe simu...
متن کاملAsymptotic normality and analysis of variance of log-likelihood ratios in spiked random matrix models
The present manuscript studies signal detection by likelihood ratio tests in a number of spiked random matrix models, including but not limited to Gaussian mixtures and spiked Wishart covariance matrices. We work directly with multi-spiked cases in these models and with flexible priors on the signal component that allow dependence across spikes. We derive asymptotic normality for the log-likeli...
متن کامل